9 research outputs found
Crosslingual Document Embedding as Reduced-Rank Ridge Regression
There has recently been much interest in extending vector-based word
representations to multiple languages, such that words can be compared across
languages. In this paper, we shift the focus from words to documents and
introduce a method for embedding documents written in any language into a
single, language-independent vector space. For training, our approach leverages
a multilingual corpus where the same concept is covered in multiple languages
(but not necessarily via exact translations), such as Wikipedia. Our method,
Cr5 (Crosslingual reduced-rank ridge regression), starts by training a
ridge-regression-based classifier that uses language-specific bag-of-word
features in order to predict the concept that a given document is about. We
show that, when constraining the learned weight matrix to be of low rank, it
can be factored to obtain the desired mappings from language-specific
bags-of-words to language-independent embeddings. As opposed to most prior
methods, which use pretrained monolingual word vectors, postprocess them to
make them crosslingual, and finally average word vectors to obtain document
vectors, Cr5 is trained end-to-end and is thus natively crosslingual as well as
document-level. Moreover, since our algorithm uses the singular value
decomposition as its core operation, it is highly scalable. Experiments show
that our method achieves state-of-the-art performance on a crosslingual
document retrieval task. Finally, although not trained for embedding sentences
and words, it also achieves competitive performance on crosslingual sentence
and word retrieval tasks.Comment: In The Twelfth ACM International Conference on Web Search and Data
Mining (WSDM '19
Exploiting Asymmetry for Synthetic Training Data Generation: SynthIE and the Case of Information Extraction
Large language models (LLMs) have great potential for synthetic data
generation. This work shows that useful data can be synthetically generated
even for tasks that cannot be solved directly by LLMs: for problems with
structured outputs, it is possible to prompt an LLM to perform the task in the
reverse direction, by generating plausible input text for a target output
structure. Leveraging this asymmetry in task difficulty makes it possible to
produce large-scale, high-quality data for complex tasks. We demonstrate the
effectiveness of this approach on closed information extraction, where
collecting ground-truth data is challenging, and no satisfactory dataset exists
to date. We synthetically generate a dataset of 1.8M data points, establish its
superior quality compared to existing datasets in a human evaluation, and use
it to finetune small models (220M and 770M parameters), termed SynthIE, that
outperform the prior state of the art (with equal model size) by a substantial
margin of 57 absolute points in micro-F1 and 79 points in macro-F1. Code, data,
and models are available at https://github.com/epfl-dlab/SynthIE.Comment: Accepted at EMNLP 202
Scalable PAC-Bayesian Meta-Learning via the PAC-Optimal Hyper-Posterior: From Theory to Practice
Meta-Learning aims to speed up the learning process on new tasks by acquiring
useful inductive biases from datasets of related learning tasks. While, in
practice, the number of related tasks available is often small, most of the
existing approaches assume an abundance of tasks; making them unrealistic and
prone to overfitting. A central question in the meta-learning literature is how
to regularize to ensure generalization to unseen tasks. In this work, we
provide a theoretical analysis using the PAC-Bayesian theory and present a
generalization bound for meta-learning, which was first derived by Rothfuss et
al. (2021). Crucially, the bound allows us to derive the closed form of the
optimal hyper-posterior, referred to as PACOH, which leads to the best
performance guarantees. We provide a theoretical analysis and empirical case
study under which conditions and to what extent these guarantees for
meta-learning improve upon PAC-Bayesian per-task learning bounds. The
closed-form PACOH inspires a practical meta-learning approach that avoids the
reliance on bi-level optimization, giving rise to a stochastic optimization
problem that is amenable to standard variational methods that scale well. Our
experiments show that, when instantiating the PACOH with Gaussian processes and
Bayesian Neural Networks models, the resulting methods are more scalable, and
yield state-of-the-art performance, both in terms of predictive accuracy and
the quality of uncertainty estimates.Comment: 61 pages. arXiv admin note: text overlap with arXiv:2002.0555
Generating Faithful Synthetic Data with Large Language Models: A Case Study in Computational Social Science
Large Language Models (LLMs) have democratized synthetic data generation,
which in turn has the potential to simplify and broaden a wide gamut of NLP
tasks. Here, we tackle a pervasive problem in synthetic data generation: its
generative distribution often differs from the distribution of real-world data
researchers care about (in other words, it is unfaithful). In a case study on
sarcasm detection, we study three strategies to increase the faithfulness of
synthetic data: grounding, filtering, and taxonomy-based generation. We
evaluate these strategies using the performance of classifiers trained with
generated synthetic data on real-world data. While all three strategies improve
the performance of classifiers, we find that grounding works best for the task
at hand. As synthetic data generation plays an ever-increasing role in NLP
research, we expect this work to be a stepping stone in improving its utility.
We conclude this paper with some recommendations on how to generate
high(er)-fidelity synthetic data for specific tasks.Comment: 8 page
Flows: Building Blocks of Reasoning and Collaborating AI
Recent advances in artificial intelligence (AI) have produced highly capable
and controllable systems. This creates unprecedented opportunities for
structured reasoning as well as collaboration among multiple AI systems and
humans. To fully realize this potential, it is essential to develop a
principled way of designing and studying such structured interactions. For this
purpose, we introduce the conceptual framework of Flows: a systematic approach
to modeling complex interactions. Flows are self-contained building blocks of
computation, with an isolated state, communicating through a standardized
message-based interface. This modular design allows Flows to be recursively
composed into arbitrarily nested interactions, with a substantial reduction of
complexity. Crucially, any interaction can be implemented using this framework,
including prior work on AI--AI and human--AI interactions, prompt engineering
schemes, and tool augmentation. We demonstrate the potential of Flows on the
task of competitive coding, a challenging task on which even GPT-4 struggles.
Our results suggest that structured reasoning and collaboration substantially
improve generalization, with AI-only Flows adding + and human--AI Flows
adding + absolute points in terms of solve rate. To support rapid and
rigorous research, we introduce the aiFlows library. The library comes with a
repository of Flows that can be easily used, extended, and composed into novel,
more complex Flows.
The aiFlows library is available at https://github.com/epfl-dlab/aiflows.
Data and Flows for reproducing our experiments are available at
https://github.com/epfl-dlab/cc_flows
PAC-Bayesian Meta-Learning: From Theory to Practice
Meta-Learning aims to accelerate the learning on new tasks by acquiring useful inductive biases from related data sources. In practice, the number of tasks available for meta-learning is often small. Yet, most of the existing approaches rely on an abundance of meta-training tasks, making them prone to overfitting. How to regularize the meta-learner to ensure generalization to unseen tasks, is a central question in the literature. We provide a theoretical analysis using the PAC-Bayesian framework and derive the first bound for meta-learners with unbounded loss functions. Crucially, our bounds allow us to derive the PAC-optimal hyper-posterior (PACOH) - the closed-form-solution of the PAC-Bayesian meta-learning problem, thereby avoiding the reliance on nested optimization, giving rise to an optimization problem amenable to standard variational methods that scale well. Our experiments show that, when instantiating the PACOH with Gaussian processes and Bayesian Neural Networks as base learners, the resulting methods are more scalable, and yield state-of-the-art performance, both in terms of predictive accuracy and the quality of uncertainty estimates. Finally, thanks to the principled treatment of uncertainty, our meta-learners can also be successfully employed for sequential decision problems
PACOH: Bayes-Optimal Meta-Learning with PAC-Guarantees
Meta-learning can successfully acquire useful inductive biases from data. Yet, its generalization properties to unseen learning tasks are poorly understood. Particularly if the number of meta-training tasks is small, this raises concerns about overfitting. We provide a theoretical analysis using the PAC-Bayesian framework and derive novel generalization bounds for meta-learning. Using these bounds, we develop a class of PAC-optimal meta-learning algorithms with performance guarantees and a principled meta-level regularization. Unlike previous PAC-Bayesian meta-learners, our method results in a standard stochastic optimization problem which can be solved efficiently and scales well. When instantiating our PAC-optimal hyper-posterior (PACOH) with Gaussian processes and Bayesian Neural Networks as base learners, the resulting methods yield state-of-the-art performance, both in terms of predictive accuracy and the quality of uncertainty estimates. Thanks to their principled treatment of uncertainty, our meta-learners can also be successfully employed for sequential decision problems.ISSN:2640-349